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Abstract 

In batch learning, stability together with existence and uniqueness of the solution corresponds 
to well-posedness of Empirical Risk Minimization (ERM) methods; recently, it was proved 
that CVioo stability is necessary and sufficient for generalization and consistency of ERM 
([9j). In this note, we introduce CVon stability, which plays a similar role in online learning. 
We show that stochastic gradient descent (SDG) with the usual hypotheses is CVon stable 
and we then discuss the implications of CVon stability for convergence of SGD. 
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1 Learning, Generalization and Stability 

In this section we collect some basic definition and facts. 
1.1 Basic Setting 

Let Z he a probability space with a measure p. A training set Sn is an i.i.d. sample Zi, i, = 
0,...,n — 1 from p. Assume that a hypotheses space "H is given. We typically assume "H to 
be a Hilbert space and sometimes a p-dimensional Hilbert Space, in which case, without loss of 
generality, we identify elements in "H with p-dimensional vectors and "H with M^. A loss function 
is a map V : x Z ^ M+. Moreover we assume that 

Iif)=E,V{f,z), 

exists and is finite for f & T-L. We consider the problem of finding a minimum of /(/) in "H. In 
particular, we restrict ourselves to finding a minimizer of /(/) in a closed subset K oi % (note 
that we can of course have K = %). We denote this minimizer by fx so that 

/(/^)=mm/(/). 
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Note that in general, existence (and uniqueness) of a minimizer is not guaranteed unless some 
further assumptions are specified. 

Example 1. An example of the above set is supervised learning. In this case X is usually a subset 
ofW^ and Y = [0, 1]. There is a Borel probability measure p on Z = X x Y . and Sn is an i.i.d. 
sample zi = (xj, yi), i, = 0, . . . , n — 1 from p. The hypotheses space Ti is a space of functions from 
X to Y and a typical example of loss functions is the square loss {y — f{x)Y . 

1.2 Batch and Online Learning Algorithms 

A batch learning algorithm A maps a training set to a function in the hypotheses space, that is 

fn = A{Sn) G 

and is typically assumed to be symmetric, that is, invariant to permutations in the training set. 
An online learning algorithm is defined recursively as /o = and 

fn+l = ^{fn, Zn)- 

A weaker notion of an online algorithm is /o = and fn+i = A{fn, Sn+i)- The former definition 
gives a memory-less algorithm, while the latter keeps memory of the past (see [5]). Clearly, the 
algorithm obtained from either of these two procedures will not in general be symmetric. 

Example 2 (ERM). The prototype example of batch learning algorithm is empirical risk mini- 
mization, defined by the variational problem 

min /„(/), 

where In{f) = E„ V{f, z), E„ being the empirical average on the sample, andH is typically assumed 
to be a proper, closed subspace ofW^, for example a ball or the convex hull of some given finite set 
of vectors. 

Example 3 (SGD). The prototype example of online learning algorithm is stochastic gradient 
descent, defined by the recursion 

/„+i = nK(/„-7nV\/(/„,z„)), (1) 

where Zn is fixed, W{fn, Zn) is the gradient of the loss with respect to f at /„, and 7„ is a 
suitable decreasing sequence. Here K is assumed to be a closed subset of H and Uk : ^ K the 
corresponding projection. Note that if K is convex then Uk is a contraction, i.e. WUkW < 1 and 
moreover if K = % then Uk = I ■ 
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1.3 Generalization and Consistency 

In this section we discuss several ways of formalizing the concept of generalization of a learning 
algorithm. We say that an algorithm is weakly consistent if we have convergence of the risks in 
probability, that is for all e > 0, 

hm P(/(/„) - liM > e) = 0, (2) 

n— >oo 

and that it is strongly consistent if convergence holds almost surely, that is 

Pflim /(/O - /(/^) = o) = 1. 

A different notion of consistency, typically considered in statistics, is given by convergence in 
expectation 

hm E[/(/„) - lifj,)] = 0. 

71— >00 

Note that, in the above equations, probability and expectations are with respect to the sample Sn- 
We add three remarks. 

Remark 1. A more general requirement than those described above is obtained by replacing lifx) 
by inf/g-^/(/). Note that in this latter case no extra assumptions are needed. 

Remark 2. Yet a more general requirement would be obtained by replacing I {fx) by inf jgjr /(/), J-" 
being the largest space such that I{f) is defined. An algorithm having such a consistency property 
is called universal. 

Remark 3. We note that, following fJ]/ the convergence ([2]) corresponds to the definition of learn- 
ability of the class H. 

1.3.1 Other Measures of Generalization. 

Note that alternatively one could measure the error with respect to the norm in "H, that is 
ll/n - /ii-ll, for example 

limP(||/„-/K|| >e) = 0. (3) 

n— ^oo 

A different requirement is to have convergence in the form 

limP(|/„(/„)-/(/„)|>e) = 0. (4) 

71— >00 

Note that for both the above error measures one can consider different notions of convergence 
(almost surely, in expectation) as well convergence rates, hence finite sample bounds. 
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For certain algorithms, most notably ERM, under mild assumptions on the loss functions, the 
convergence (jl]) implies weak consistenc}!^. For general algorithms there is no straightforward 
connection between (jlj) and consistency ([2]). 

Convergence ([3]) is typically stronger than ([2]), in particular this can be seen if the loss satisfies 
the Lipschitz condition 

\V{f,z)-V{f,z)\<L\\f-f\\, L>0, (5) 

for all f.f'&'H and z & but also for other loss function which do not satisfy ^ such as the 
square loss. 

1.4 Stability and Generalization 

Different notions of stability are sufficient to imply consistency results as well as finite sample 
bounds. 

A strong form of stability is uniform stability 

sup sup SnY>\V{fn,z) -V{fn,z',z)\ < I3n 
z&Z zi,...,z„ z'GZ 

where fn,z' is the function returned by an algorithm if we replace the i-th point in Sn by z' and /3„ 
is a decreasing function of n. 

Bousquet and Eliseef prove that the above condition, for algorithms which are symmetric, 
gives exponential tail inequalities on /(/„) — Inifn) meaning that we have 5(e, n) = e"*"^ " for 
some constant C [2]. Furthermore, it was shown in [10] that ERM with a strongly convex loss 
function is always uniformly stable. Weaker requirements can be defined by replacing one or more 
supremums with expectation or statements in probability; exponential inequalities will in general 
be replaced by weaker concentration. A thorough discussion and a list of relevant references can 
be found in [6l |7j. Notice that the notion of CVioo stability introduced there is necessary and 
sufficient for generalization and consistency of ERM ([9j) in the batch setting of classification and 
regression. This is the main motivation for introducing the very similar notion of CVon stability 
for the online setting in the next sectioiJl- 

2 Stability and SGD 

Here we focus on online learning and in particular on SGD and discuss the role played by the 
following definition of stability, that we call CVon stability 

1 In fact for ERM 

F(/(/„) - H/k) >e) = P(/(/„) - /„(/„) + /„(/„) - LaifK) + IuUk) - IUk) > e) 

< P(/(/„) - /„(/„) > e) + P(/„(/„) - UIk) >e)+ nUfx) - HfK) > e) 

The first term goes to zero because of ([4]), the second term has probabihty zero since /„ minimizes /„, the third 
term goes to zero if V{fK, z) is a well behaved random variable (for example if the loss is bounded but also under 
weaker moment/tails conditions). 

^Thus for the setting of batch classification and regression it is not necessary (S. Shalev-Schwartz, pers. comm.) 
to use the framework of 8]). 
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Definition 2.1. We say that an online algorithm is CVon stable with rate /3n if for n > N we have 

-Pn< E,Jy(/„+i,Z„) - V{f^,Z^)\Sr,] < 0, (6) 

where Sn = Zq, . . . , z^-x and /3„ > goes to zero with n. 
The definition above is of course equivalent to 

< E,^[V{fn,Zn) - V{fn+l,Zr,)\Sn] < /?„. (7) 

In particular, we assume "H to be a p- dimensional Hilbert Space and V{-,z) to be convex and 
twice differentiable in the first argument for all values of z. We discuss the stability property of 
([T]) when K is a closed, convex subset; in particular, we focus on the case when we can drop the 
projection so that 

fn+l = fn- ln^V{fn, Z^) . (8) 

2.1 Setting and Preliminary Facts 

We recall the following standard result, see [1] and references therein for a proof. 
Theorem 1. Assume that, 

• There exists fx G K, such that Vl^fx) = 0, and for all f ETi, (/ — fx, V/(/)) > 0. 

• There exists D > 0, such that for all fn G Ti, 

E.„[||Vy(/„, z)f \Sn] < D{1 + ||/„, - fKf). (9) 

Then, 

P(lim ||/„-/^||=0) = l. 

n— >oo 

The following result will be also useful. 

Lemma 1. Under the same assumptions of TheoremUl if fK belongs to the interior of K, then 
there exists N > such that for n > N , fn E K so that the projections of are not needed and 
the fn are given by fn+i = fn- 'ln'^V{fn, Zn). 

2.1.1 Stability of SGD 

Throughout this section we assume that 

(/, H{yU, z))f) > \\H{V{f, zm < M < oo, (10) 
for any f EH and z E Z; H{V{f, z)) is the Hessian of V. 



6 



Theorem 2. Under the same assumption of TheoremUi there exists N such that for n > N, SGD 
satisfies CVon with (3n = C'jn, where C is a universal constant. 

Proof. Note that from Taylor's formula, 

[V{fn+1, Z^) - VUn. Z^)] = - /„, VVUn. Z^)) + 1/2 - /„, HiVU. Z^Wn+l - fn)) ,(11) 

with / = afn + (1 — a)fn+i for < a < 1. We can use the definition of SGD and Lemma [H to 
show there exists N such that for n > N, fn+i — fn = InV^fn, Zn)- Hence changing signs in ( ITTi) 
and taking the expectation w.r.t. conditioned over 5"^ = Zq, . . . , Zn-i, we get 

E,jr(/„,;2„)-r(/„+i,^„)|^J = 

7„E,J||Vy(/„,z„)f |5„] + l/27^E,J(Vn/n,^n),^(n/,^n))V/y(/„,^„))|5„]. (12) 

The above quantity is clearly non negative, in particular the last term is non negative because of 
([TUD . Using © and ^ we get 

E.JV(/„,Z„) - V{fn+l^Zn)\Sn] = (7n + 1/27^M)D(1 + [|| - /;,|| |^„]) < C7., 

if n is large enough. □ 

A partial converse result is given by the following theorem. 
Theorem 3. Assume that, 

• There exists fx G K, such that VI{fK) = 0, and for all f eT-L, (/ — fx, V/(/)) > 0. 

• There exists C,N > 0, such that for all n > N , holds with /3„ = C'jn- 
Then, 

Pflim =o) =1. (13) 

Proof. Note that from ffTTj) we also have 

E,AV{fn+,,Zn)-V{fn,Zn)\S^] = 

-7nE,J||VV^(/„,zJin^J + l/27^E,J(Vy(/„,z„),i/(r(/,^„))V/V^(/„,^„)) 1^,]. 
so that using the stability assumption and (ITO!) we obtain, 

< (l/27^-7n)E,J||VK(/„,,z„)|H^„] 

that is, 

E,jiivr(/„,^„)in5„] < 



|2 I o 1 / l^n C'jn 



(7„ - M/272) (7„ - M/272) • 
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From Lemma [T] for n large enough we obtain 

|2 I 2 \\T-r\rl f M|2 



< ll/n - /A'lr + 7n II V\/(/„, Z„)r - 27n (/n " /i^, VK(/„, 

SO that taking the expectation w.r.t. Zn conditioned to Sn and using the assumptions, we write 

E.J||/„,+ l - /xf \Sn] < ll/n - /xf +7'E,J||V\/(/„,Z„)f \Sn] - 27„ (/„ - /x,E,„[V\/(/n,^n)|5„ 

< ll/n -/xf+ 7^7 ^^j7^-27n(/n-//.,V/(/„)), 

(7n - 

since KjW{fn, Zn)\Sn] = V/(/„). The series En 7n (-y„-^/27ii) converges and the last inner 
product is positive by assumption, so that the Robbins-Siegmund's theorem implies ( IT3|) and the 
theorem is proved. □ 



A Remarks: assumptions 

• The assumptions will be satisfied if the loss is convex (and twice differentiable) and "H is 
compact. In fact, a convex function is always locally Lipschitz so that if we restrict to be 
a compact set, V satisfies (jS]) for 

L= sup \\VV{f,z))\\<oo. 

Similarly since V is twice differentiable and convex, we have that the Hessian H{V{f,z)) of 
V at any / G "H and z & Z is identified with a bounded, positive semi-definite matrix, that 
is 

{f,H{V{f,z))f) > \\H{V{f,zm < 1< 00, 

for any / G "H and z & Z, where for the sake of simplicity we took the bound on the Hessian 
to be 1. 

• The gradient in the SGD update rule can be replaced by a stochastic subgradient with little 
changes in the theorems. 



B Learning Rates, Finite Sample Bounds and Complexity 

B.l Connections Between Different Notions of Convergence. 

It is known that both convergence in expectation and strong convergence imply weak convergence. 
On the other hand if we have weak consistency and 

00 

5^P(/(/„) -/(/,,) >e)< 00 

n=l 

for all e > 0, then weak consistency implies strong consistency by the Borel-Cantelli lemma. 
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B.2 Rates and Finite Sample Bounds. 

A stronger result is weak convergence with a rate, that is 



P(/(/„,)-/(/x)>e)>5(n,e), 



where 5{n, e) decreases in n for all e > 0. We make two observations. First, one can see that 
the Borel-Cantelh lemma imposes a rate on the decay of 6{n,e). Second, typically 6 = 5(n, e) is 
invertible in e so that we can write the above result as a finite sample bound 



B.3 Complexity and Generalization 

We say that a class of real valued functions J-" on Z is uniform Glivenko-Cantelli if the following 
limit exists 



for all e > 0. If we consider the class of functions induced by V and Ti, that is F(-) = V{f, ■), 
/ e the above properties can be written as 



Clearly the above property implies (jl]), hence consistency of ERM if /-^ exists and under mild 
assumption on the loss - see previous footnote. 

It is well known that UGC classes can be completely characterized by suitable capacity /complexity 
measures of H. In particular a class of binary valued functions is UGC if and only if the VC- 
dimension is finite. Similarly a class of bounded functions is UGC if and only if the fat- shattering 
dimension is finite. See p] and reference therein. 

Finite complexity of "H is hence a sufficient condition for the consistency of ERM. 

B.4 Necessary Conditions 

One natural question is weather the above conditions are also necessary for consistency of ERM 
in the sense of ([2]), or in other words if consistency of ERM on implies that H is UGC class. 

An argument in this direction is given by Vapnik which call the result the key theorem in 
learning (together with the converse direction). Vapnik argues that ([2]) must be replaced by a much 
stronger notion of convergence essentially holding if we replace T-L with T-L^ = {/G?^|/(/)>7}, 
for all 7. 

Another result in this direction is given without proof in [1]. 



nHfn)-HfK)<e{n,6))>l-6. 




0. 




(14) 
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B.5 Robbins Siegmund's Lemma 

We use the stochastic approximation framework described by Duflo ([3], pp 6-15). 

We assume a sequence of data Zi defined by a probabihty space fi, A, P and a filtration F = 
(*^)neN where Fn is a a- field and G J^n+i- In addition a sequence Xn of measurable functions 
from f2, A to another measurable space is defined to be adapted to F if for all n, is 
measurable. 

Definition Suppose that X = is a sequence of random variables adapted to the filtration 
F. X is a supermartingale if it is integrable (see 13]) and if 

The following is a key theorem ([3]). 

Theorem B.l. (Robbins- Siegmund) Let {Q, J^, P) be a probability space. LetiVn), {Pn), iXn), iVn) 
be finite non-negative J-'n-mesurable random variables, where J-'i G ■ ■ ■ G Tn G ■ ■ ■ is a sequence 
of sub-a- algebras of J-". Suppose that (Vn), {f3n), (Xn), ijln) are four positive sequences adapted to 
F and that 

E[K+l|-^n] < K(l + Pn) +Xn- Vn- 

Then ^ /3„ < oo and J2 Xn < oo, almost surely (Vn) converges to a finite random variable 
and the series ^rjn converges. 

We provide a short proof of a special case of the theorem. 

Theorem B.2. Suppose that (Vn) and {rjn) are positive sequences adapted to ¥ and that 

E[K+1 I J^n] <Vn- Vn. 

Then almost surely (Vn) converges to a finite random variable and the series ^ converges. 
Proof 

Let Yn = Vn + J2k=i Vk- Then we have 

n n 

Wn+l I J'n] = 1 I -^n] + 5^ ^Vk | -^n] < K " + % = Y^. 

k=l k=l 

So (Yn) is a supermartingale, and because and {rjn) are positive sequences, is also 
bounded from below by 0, which implies it converges almost surely. It follows that both (Vn) and 
^rjn converge. 
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